The basics of ggplot2

Author

Rosana Zenil-Ferguson and Jeremy Van Cleve

Published

March 6, 2025

notes to self

have 2-3 plot with these features

  • plot with text labels that need adjusting
  • unedited axes labels
  • points with different colors
  • background color
  • hline for editing

show them how to save as pdf

send above list to jenna

Workshop materials

Workshop materials can be found at https://github.com/vancleve/data2design_uk.

Download R and RStudio

If you haven’t already downloaded and installed R and RStudio, please do so now.

R

Rstudio

https://posit.co/download/rstudio-desktop/

Getting a Project going in RStudio

RStudio cheatsheet

Create a new project in RStudio

Using a “project” allows us to keep all our files in one place and helps RStudio understand which files belong to which project

  1. Go to File->New Project->New directory->New project
  2. Select an easy to find directory
  3. Name your project something descriptive like “data2design_ggplot_tutorial”.
  4. Save this qmd file into your project directory.
    Go to https://github.com/vancleve/data2design_uk/raw/refs/heads/main/ggplot_tutorial.qmd and Save As... in your browser.

Packages

Packages are groups of different functions or “actions” that allows us to do special tasks. People write packages to help others with reproducibility of analyses.

We will be working with the plotting package ggplot2 and a set of packages for data wrangling called tidyverse (https://www.tidyverse.org/). By installing just tidyverse, we can get all those packages at once.

install.packages("tidyverse") # packages for data wrangling and plotting including dplyr, readr, tidyr, stringr, readxl, ggplot2, and others

Once installed, packages must be loaded into our “environment” (it is like bringing a pen if you are planning to write, not only buying it!). To load them we use the function library()

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

Reading data

The function read_excel from the package readxl allows us to read excel files and read_csv allows us to read csv (comma separated value) files. We will first load a dataset from Onstein et al. (2019)1 frugivory‐related traits and dispersal in the custard apple family. In the code below, we’ll load the data directly from Dryad (https://datadryad.org/dataset/doi:10.5061/dryad.2hd8b0s), which is a repository for archiving datasets, by first downloading the xlsx file and then loading it with read_excel. We’ll also simplify some of the column names too for clarity and turn some of the columns from log values into normal values.

tf = tempfile(fileext = ".xlsx")
curl::curl_download("https://datadryad.org/downloads/file_stream/82494", tf)

fruits <- 
  read_excel(tf, sheet = "Matrix for analysis", na = "NA") |>
  rename(Taxon_tree = Species_tree, Taxon = Species_PROTEUS) |>
  mutate(across(starts_with("Log_"), exp, .names = "Exp_{col}")) |>
  rename_with(\(x) gsub("Exp_Log_", "", x, fixed = TRUE))

## Check what is inside the dataset, was it read fine?
fruits
# A tibble: 228 × 27
   Taxon_tree                   Taxon   Log_Fruit_length_avg Log_Fruit_width_avg
   <chr>                        <chr>                  <dbl>               <dbl>
 1 Alphonsea_boniana_JB57       Alphon…               NA                 NA     
 2 Alphonsea_elliptica_JB34     Alphon…                0.763              0.580 
 3 Alphonsea_kinabaluensis_JB85 Alphon…                0.415              0.279 
 4 Ambavia_gerrardii_PB         Ambavi…                0.301              0.114 
 5 Anaxagorea_javanica_JB       Anaxag…                0.544              0.544 
 6 Anaxagorea_luzonensis_JB     Anaxag…                0.415             -0.222 
 7 Anaxagorea_phaeocarpa_0498   Anaxag…                0.505              0.301 
 8 Anaxagorea_silvatica_0113    Anaxag…                0.176              0.0414
 9 Annickia_chlorantha_0976     Annick…                0.114             -0.155 
10 Annickia_kummeriae_MWC7004   Annick…                0.342              0.0414
# ℹ 218 more rows
# ℹ 23 more variables: Log_Fruit_number_avg <dbl>, Log_Stipe_length_avg <dbl>,
#   Log_Seed_length_avg <dbl>, Log_Seed_width_avg <dbl>,
#   Log_Seed_number_avg <dbl>, Log_Height_max <dbl>, Bright <chr>,
#   Pubescent <chr>, Defence <chr>, Fruit_type <chr>, Cauliflory <chr>,
#   Moniliform <chr>, Dehiscence <chr>, Shrub <chr>, Conspicuousness <chr>,
#   Fruit_length_avg <dbl>, Fruit_width_avg <dbl>, Fruit_number_avg <dbl>, …

Plotting and understanding our data

Basics of ggplot2: bar plots

The ggplot2 package is built on the idea that graphics have can have a “grammar”, or set or rules, that specifies how they can and should be constructed. Implementing these rules not only makes creating graphics easier, but it makes such graphics consistent and clear. Hadley Wickham, the creator of ggplot2, borrows this idea from the book, “The Grammar of Graphics”” by Wilkinson, Anand, and Grossman (2005)2. While this structure may seem a bit artificial at first, it makes creating graphics very modular and building up complex graphics much easier.

Our first plot will be one of the simplest, which is a bar plot. Let’s plot the number of species by their Fruit_type.

ggplot(fruits) +
  geom_bar(aes(x = Fruit_type))

The above ggplot command has two pieces. The first is a call to ggplot with the name of the data table. This command by itself creates a blank canvas onto which we can plot using data from the data table. The second piece is to “add” a layer to this canvas with +, and that layer is a bar plot. The geom part is short for geometry since all the graphical elements of different plot types are made up of geometric pieces. In the argument to geom_bar, we give an “aesthetic mapping” with aes(x = Fruit_type), which says we want the x-axis to map to the Fruit_type column of our data. Finally, geom_bar builds bars automatically whose height is given by the number of rows in the data table with each x value.

This plot is a bit boring but we can spice it up with another aesthetic. Let’s try the color of the bar, which is known as the “fill”. We’ll set the fill to the Cauliflory variable.

ggplot(fruits) +
  geom_bar(aes(x = Fruit_type, fill = Cauliflory))

ggplot is smart here and automatically adds a legend for the color aethetic since you otherwise wouldn’t know which color mapped to which value of Dehiscence. In fact, ggplot also automatically added the tick labels for the same reason for the x-axis aesthetic.

Grids of plots

Looking at our dataset, there are a few other fruit variables we can look at like Dehiscence and Moniliform. What if we wanted to see if how the number of each Fruit_type varies as a function of different values of Dehiscence and Moniliform? One way to visualize this would be to replicate the bar plot above for each combination of Dehiscence and Moniliform. It turns out that ggplot let’s us do this very easily by adding a facet_wrap layer onto our bar plot.

ggplot(fruits) +
  geom_bar(aes(x = Fruit_type, fill = Cauliflory)) +
  facet_wrap(vars(Dehiscence, Moniliform))

What can we observe from this?

Notice that the y-axis scales are all the same. This is on purpose and allows us to compare the data across the panels of the plot. However, we miss the variation in the panels with fewer observations. To allow the y-axis scales to adjust in each panel, we can add scales = "free_y" to facet_wrap.

ggplot(fruits) +
  geom_bar(aes(x = Fruit_type, fill = Cauliflory)) +
  facet_wrap(vars(Dehiscence, Moniliform), scales = "free_y")

Scatter plots

When do we use scatter plots? Scatter plots are great for looking at how variables are correlated to one another and for seeing the full distribution of the data since you typically plot every single point.

Let’s make our first basic scatter plot. Notice that we put the aes command as an argument to the ggplot command. This works just as well as putting it in the geom_point command except that we can add on additional geometries without having to repeat the aes command.

ggplot(fruits, aes(x=Fruit_length_avg, y=Fruit_width_avg)) + 
  geom_point()
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

What do we notice?

Now let’s add a regression (linear model) line through the points.

ggplot(fruits, aes(x = Fruit_length_avg, y = Fruit_width_avg)) + 
  geom_point() +
  geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

The geom_smooth function is what adds the linear regression line. Note here we have to add method="lm" to tell geom_smooth to add a straight line (i.e., linear); otherwise, it defaults to a more complex method that fits a smooth curve.

ggplot(fruits, aes(x = Fruit_length_avg, y = Fruit_width_avg)) + 
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

By adding a color aesthetic, the points are colored based on the value of the variable we map to color. Here, let’s map Shrub to color and keep the linear regression.

ggplot(fruits, aes(x = Fruit_length_avg, y = Fruit_width_avg, color = Shrub)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE, fullrange = TRUE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

Are groups different?

Volcano plots

Scatter plots can be used to visualize all kinds of data and are especially population in genomics data. For example, differential expression analyses using RNA-seq are very common and a “volcano plot” is often used to show which genes have increased and which have decreased expression in an experiment. We’ll use a dataset common in RNA-seq differential expression tutorials3, airway, which comes from an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone4. The data from airway can be loaded from the workshop GitHub site.

airway_de = read_csv("https://raw.githubusercontent.com/vancleve/data2design_uk/refs/heads/main/airway_de_transcript.csv")
Rows: 15926 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): .feature, albut, gene_id, gene_name, gene_biotype, seq_name, symbol
dbl (8): gene_seq_start, gene_seq_end, seq_strand, logFC, logCPM, F, PValue,...
lgl (4): entrezid, seq_coord_system, GRangesList, .abundant

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airway_de
# A tibble: 15,926 × 19
   .feature        albut gene_id  gene_name entrezid gene_biotype gene_seq_start
   <chr>           <chr> <chr>    <chr>     <lgl>    <chr>                 <dbl>
 1 ENSG00000000003 untrt ENSG000… TSPAN6    NA       protein_cod…       99883667
 2 ENSG00000000419 untrt ENSG000… DPM1      NA       protein_cod…       49551404
 3 ENSG00000000457 untrt ENSG000… SCYL3     NA       protein_cod…      169818772
 4 ENSG00000000460 untrt ENSG000… C1orf112  NA       protein_cod…      169631245
 5 ENSG00000000971 untrt ENSG000… CFH       NA       protein_cod…      196621008
 6 ENSG00000001036 untrt ENSG000… FUCA2     NA       protein_cod…      143815948
 7 ENSG00000001084 untrt ENSG000… GCLC      NA       protein_cod…       53362139
 8 ENSG00000001167 untrt ENSG000… NFYA      NA       protein_cod…       41040684
 9 ENSG00000001460 untrt ENSG000… STPG1     NA       protein_cod…       24683489
10 ENSG00000001461 untrt ENSG000… NIPAL3    NA       protein_cod…       24742284
# ℹ 15,916 more rows
# ℹ 12 more variables: gene_seq_end <dbl>, seq_name <chr>, seq_strand <dbl>,
#   seq_coord_system <lgl>, symbol <chr>, GRangesList <lgl>, .abundant <lgl>,
#   logFC <dbl>, logCPM <dbl>, F <dbl>, PValue <dbl>, FDR <dbl>

Each row of the table is gene whose expression was measured in the experiment, logFC is the log2 of the ratio of the expression in the treatment vs the control, PValue is the significance of the comparison, and FDR is false discovery rate.

Plotting the volcano plot is as simple as plotting the logFC on the x-axis and the -log10(Pvalue) on the y-axis. We can color the points by those with a FDR < 0.05.

ggplot(airway_de, aes(x = logFC, y = -log10(PValue), colour = FDR < 0.05)) +
    geom_point()

What if we want to color points as well by those with a greater than 2 fold change in expression up or down? Then we can do a little wrangling to create a new signficance column and use that for the color.

airway_de_sig = airway_de |>
  mutate(significance =
           case_when(
             FDR < 0.05 & abs(logFC) >= 2 ~ "significant FDR and FC",
             FDR < 0.05 & abs(logFC) < 2 ~ "significant FDR",
             FDR > 0.05 & abs(logFC) >= 2 ~ "significant FC",
             .default = "non-significant"
           ))


ggplot(airway_de_sig, aes(x = logFC, y = -log10(PValue), colour = significance)) +
  geom_point()

Finally, what about labeling some of the genes with the most significant (i.e., lowest) p-values? To do this, we need a little more data wrangling and a new geometry called geom_text that places a label on top of a data point.

airway_de_sig_lbl = 
  airway_de_sig |>
  mutate(label = ifelse(PValue < 5e-9, gene_name, ""))
  
plt = ggplot(airway_de_sig_lbl, aes(x = logFC, y = -log10(PValue))) +
  geom_point(aes(colour = significance)) +
  geom_text(aes(label = label))
plt

This looks great except the labels are right on top of the points and they overlap. We can fix this in Illustrator! Let’s save the plot to a PDF so we can load it into illustrator later.

ggsave("airway_de_plot.pdf", plt, width = 8, height = 5)

Plotting distributions

Often we are interested in understanding how often observations of a certain size appear in our data. For example, how many species have fruit of a certain length? To answer questions like this, we need to visualize the distribution of the data.

Questions about distributions often come up when our variables are contrinous or metric, which means that we can add and subtract values of the variable and create summaries of the variable like its mean value.

Histograms

One of the most common ways of visualizing a distribution is with a histogram, which groups observations into bins (or cajitas) to show us which kinds of observations are more frequent.

Let’s make a basic histogram!

ggplot(fruits, aes(x=Fruit_length_avg)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).

Let’s interpret it!

Making histograms pretty and more useful

Changing colors in and out and saving it

p_histogram<- ggplot(fruits, aes(x=Fruit_length_avg))+ geom_histogram(color="darkblue", fill="hotpink")
p_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).

What if we want to separate the data by the type of plant? and we have a variable Shrub in the dataset that tells us which one is a shrub or not a shrub

p_histogram<-ggplot(fruits, aes(x=Fruit_length_avg, color=Shrub)) + 
  geom_histogram(fill="white")
p_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).

What if we want to add the mean of each group?

## Calculating the mean per group using Dplyr
## we are naming the mean of each group mean_fl (as a short for mean fruit length)
mu<-fruits %>% group_by(Shrub) %>% 
  summarize(mean_fl=mean(Fruit_length_avg, na.rm=TRUE),
            .groups = 'drop')
mu
# A tibble: 3 × 2
  Shrub     mean_fl
  <chr>       <dbl>
1 not_shrub    1.62
2 shrub        1.56
3 <NA>       NaN   

The little symbol %>% are called pipes in this package. Piping is an important tool for reproducibility. You never change your original data, you only reorder internally and put through a pipe line

Ok, so what is happening with the dimension of mu? what do you notice?

dim(mu)
[1] 3 2
### (there are NAs this will be annoying later on, so we are going to remove)
mu<-mu[2:3,] ## discuss here how R reads tables and matrices as rows and columns!!!!
# another option mu<- mu[-1,]

Okay so now finally add the mean to the groups and the histogram

p_histogram<-ggplot(fruits, aes(x=Fruit_length_avg, color=Shrub)) +
  geom_histogram(fill="white", position="dodge")+
  geom_vline(data=mu, aes(xintercept=mean_fl, color=Shrub),
             linetype="dashed")+
  theme(legend.position="top")
p_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_vline()`).

What if I want my own colors? Let’s check some cool options (aka coolors)

p_histogram<- p_histogram+scale_color_manual(values=c("hotpink", "#56B4E9")) #EXPLAIN
p_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_vline()`).

But I want to make it publication quality! Let’s make nice labels on the axes and clean the background

p_histogram<- p_histogram+xlab("Average Fruit Length")
p_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_vline()`).

p_histogram<-p_histogram+theme_classic()
p_histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 15 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_vline()`).

Density plots

What are density plots?- They are areas under the curve and later they will help us to decide about probability!

Let’s make a simple one!

p_density<-ggplot(fruits, aes(x=Fruit_length_avg)) + geom_density()
p_density
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_density()`).

Make it pretty colors

p_density<-ggplot(fruits, aes(x=Fruit_length_avg))+
  geom_density(color="darkblue", fill="lightblue")
p_density
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_density()`).

Remember those pesky NAs what happens in density plots if we remove them. For that we will use the function subset().

p_density<-ggplot(data=subset(fruits,!is.na(Fruit_length_avg)), aes(x=Fruit_length_avg)) + geom_density()
p_density

We will add the mean and remember to remove NAs (do one with and one without NAs)

mean.fruit=mean(fruits$Fruit_length_avg,na.rm=TRUE)
mean.fruit
[1] 1.596013
p_density<-p_density+ geom_vline(aes(xintercept=mean.fruit),
              color="blue", linetype="dashed", size=1)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
p_density

Change density plot line colors by groups

p_density<-ggplot(fruits, aes(x=Fruit_length_avg, color=Shrub)) +
  geom_density() + geom_vline(data=mu, aes(xintercept=mean_fl, color=Shrub),
             linetype="dashed")
p_density
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_vline()`).

# Fill them in 
p_density<-ggplot(fruits, aes(x=Fruit_length_avg, fill=Shrub)) + geom_density(alpha=0.4)+ geom_vline(data=mu, aes(xintercept=mean_fl, color=Shrub),  linetype="dashed")
p_density
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_density()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_vline()`).

Adding my color scheme (tolk about coolors here)

p_density<-p_density+scale_fill_manual(values=c("hotpink", "#56B4E9"))
p_density
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_vline()`).

Making it publication style

p_density<- p_density+ xlab("Average Fruit Length")+theme_classic()
p_density
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_vline()`).

Histogram with density plot

p_histdensity<-ggplot(fruits, aes(x=Fruit_length_avg)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="hotpink") 
p_histdensity
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_density()`).

Making Box and Whisker plots

What are box and whisker plots? What are quartiles and interquartile difference??

ggplot(data=fruits, aes(x=Fruit_length_avg)) + 
  geom_boxplot()
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_boxplot()`).

How do we make it vertical???

p_boxplot<-ggplot(data=fruits, aes(y=Fruit_length_avg)) + 
  geom_boxplot()
p_boxplot
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_boxplot()`).

How do we make it box and whisker plots by group??

p_boxplot<-ggplot(fruits, aes(y=Fruit_length_avg, color=Shrub)) +
  geom_boxplot()
p_boxplot
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_boxplot()`).

How do we change the color?

p_boxplot<-ggplot(fruits, aes(x=Shrub,y=Fruit_length_avg, color=Shrub)) +
  geom_boxplot()+scale_color_manual(values=c("hotpink", "#56B4E9"))
p_boxplot
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_boxplot()`).

How do we fill them in

p2<-ggplot(fruits, aes(x=Shrub,y=Fruit_length_avg, fill=Shrub)) +
  geom_boxplot()

p2+scale_fill_manual(values=c("hotpink", "#56B4E9"))
Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Eliminating those pesky NAs from plots

p_boxplot<-ggplot(data=subset(fruits, !is.na(Fruit_length_avg)), aes(x=Shrub,y=Fruit_length_avg, color=Shrub)) +
  geom_boxplot()+scale_color_manual(values=c("hotpink", "#56B4E9"))
p_boxplot

How do we add the mean?

p_boxplot<-p_boxplot + geom_point(data=mu,aes(x=Shrub,y=mean_fl),shape=23, size=4)
p_boxplot
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Adding good labels

p_boxplot<-p_boxplot+xlab("Shrub status")+ylab("Average fruit length")
p_boxplot
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

How do we add all the sample?

p_boxplot<-p_boxplot + geom_jitter(shape=16, position=position_jitter(0.2))
p_boxplot
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

All pretty and ready for publication

p_boxplot<-p_boxplot+theme_classic()
p_boxplot
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Let’s load the libraries and the dataset

library(airway)
library(tidybulk)

# load airway RNA sequencing data
data(airway)

# convert to tidybulk tibble
tbl_airway <-
    airway |>
    tidybulk()
tbl_filtered = tbl_airway |>
  mutate(.sample=str_remove(.sample, "SRR1039")) |>
  mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db, keys = as.character(.feature), keytype = "ENSEMBL", column="SYMBOL", multiVals = "first")) |>
  keep_abundant(factor_of_interest=dex)
'select()' returned 1:many mapping between keys and columns
tbl_de = tbl_filtered |>
   test_differential_abundance(
      .formula = ~ 0 + dex + cell,
      contrasts = c("dextrt - dexuntrt"),
      omit_contrast_in_colnames = TRUE
    )
=====================================
tidybulk says: All testing methods use raw counts, irrespective of if scale_abundance
or adjust_abundance have been calculated. Therefore, it is essential to add covariates
such as batch effects (if applicable) in the formula.
=====================================
This message is displayed once per session.
tidybulk says: to access the raw results (fitted GLM) do `attr(..., "internals")$edgeR`
This message is displayed once per session.
tbl_transcript = tbl_de |> pivot_transcript()

tbl_transcript |>
  ggplot(aes(x = logFC, y = PValue, colour = FDR < 0.05)) +
  geom_point() +
  scale_y_continuous(trans = "log10_reverse")

tbl_transcript |> write_csv("airway_de_transcript.csv")

Footnotes

  1. Onstein, R. E., W. D. Kissling, L. W. Chatrou, T. L. P. Couvreur, H. Morlon, and H. Sauquet. 2019. Which frugivory-related traits facilitated historical long-distance dispersal in the custard apple family (Annonaceae)? Journal of Biogeography 46:1874–1888.↩︎

  2. Wilkinson, L. 2005. The grammar of graphics. Statistics and computing. Springer New York.↩︎

  3. https://stemangiola.github.io/rpharma2020_tidytranscriptomics/articles/tidytranscriptomics.html↩︎

  4. Himes, B. E., X. Jiang, P. Wagner, R. Hu, Q. Wang, B. Klanderman, R. M. Whitaker, et al. 2014. RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells. PLOS ONE 9:e99625.↩︎